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O ■ ABSTRACT 
(N 

■ We describe the development of an FX style correlator for Very Long Base- 
line Interferometry (VLBI), implemented in software and intended to run in 

^ ■ multi-processor computing environments, such as large clusters of commodity 

machines (Beowulf clusters) or computers specifically designed for high perfor- 
mance computing, such as multi-processor shared-memory machines. We outline 
the scientific and practical benefits for VLBI correlation, these chiefly being due 

■ to the inherent flexibility of software and the fact that the highly parallel and 
^ . scalable nature of the correlation task is well suited to a multi-processor com- 
O ■ puting environment. We suggest scientific applications where such an approach 

to VLBI correlation is most suited and will give the best returns. We report 
detailed results from the Distributed FX (DiFX) software correlator, running 
on the Swinburne supercomputer (a Beowulf cluster of ~300 commodity pro- 
. cessors), including measures of the performance of the system. For example, to 

correlate all Stokes products for a 10 antenna array, with an aggregate band- 



o 
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width of 64 MHz per station and using typical time and frequency resolution 
presently requires of order 100 desktop-class compute nodes. Due to the effect 
of Moore's Law on commodity computing performance, the total number and 
cost of compute nodes required to meet a given correlation task continues to de- 
crease rapidly with time. We show detailed comparisons between DiFX and two 
existing hardware-based correlators: the Australian Long Baseline Array (LBA) 
S2 correlator, and the NRAO Very Long Baseline Array (VLBA) correlator. In 
both cases, excellent agreement was found between the correlators. Finally, we 
describe plans for the future operation of DiFX on the Swinburne supercomputer, 
for both astrophysical and geodetic science. 
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Subject headings: Techniques: interferometric — instrumentation: interferome- 
ters — pulsars: general — radio continuum: general — radio lines: general 



Introduction 



The technique of Very Long Baseline Interferometry (VLBI), as a means to study the 
very high angular resolution structure of celestial ra dio sources, was developed in the 1960s 
( IClark. Cohen &: Jaunceyl 119671 : iMoran et al.l 119671 ) . Some accounts of the early develop- 
ments i n VLBI, the scientific motivations for the developments, and technical overviews are 



given m 



Finlev fc Gosd fl2000f ) 



VLBI, as with all interferometry at radio wavelengths, hinges on the abilty to obtain 
a digital representation of the electric field variations at a number of spatially separated 
locations (radio telescopes), accurately time-tagged and tied to a frequency standard. The 
digitised data are transported to a single location for processing (a correlator) and are co- 
herently combined in order to derive information about the high angular resolution structure 
of the target sources of radio emission. The instantaneous angular resolution i? of a VLBI 
array in arcseconds is given hy R = 2.52 x 10^-^, where A is wavelength of the radiation 
being observed (typically centimetres) and D is the maximum projected baseline (the dis- 
tance between radio telescopes in the array projected onto a plane perpendicular to the 
source; typically thousands of kilometers). This yields typical angular resolutions of order 
milliarcseconds. 

Traditionally, the "baseband" d ata (filtered, down-converted, samp led, and quantised 
electric field strength measurements: [Thompson. Moran &: Swensodll994l ) generated at each 
radio telescope ha ve been recorded to ma gnetic tape m edia, for example: the Mark I system 



( Bare et al.l 119671 ): the Ma rk II system (ClarkI Il973l ): the M ark III system (Rogers et al. 



19831), the Mark IV system flWhitnevlll993r ): and the S2 system flWietfeldt et al.lll996h . After 
observation, the tapes from each telescope were shipped to a purpose-built and dedicated 
digital signal processor, the correlator. A correlator aligns the recorded data streams, corrects 
for various geometrical and instrumental effects, and coherently combines the data from the 
different independent pairs of radio telescopes. The correlator output streams, known as the 
visibilities, are related to t he sky brightness distribution of the r adio source essentially via a 
Fourier transform relation (IThompson. Moran fc Swensorull994j ). 



The two fundamental operations required to combine or correlate the recorded signals 
are a Fourier transform (F) and a cross- multiplication (X). The order of these operations 
can be interchanged to obtain the same result, leading to the so-called XF and FX correlator 
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architectures. A number of well-known descriptions of the theory and practise of radio 
interferometry describe the technique in var ying degrees of detail and elaborate upon the 
differ ences between XF and FX correlators (IThompson. Moran fc SwensonI Il994j : iRomney 
19991 ). and the reader is referred to these texts for the details. 



Both XF and FX style correlators have traditionally been highly application-specific de- 
vices, based on purpose-built integrated circuits. In the last 20 years, Field Programmable 
Gate Arrays (FPGAs) have become popular in correlator de signs, with one pro minent exam- 
ple being the Very Long Baseline Array (VLBA) correlator (INapier et al.lll994l ). FPGAs are 
reconfigurable or reprogrammable devices that offer more flexibility than application-specific 
integrated circuits (ASICs) while still being highly efficient. 

This paper deals with a departure from the traditional approach of tape-based data 
recording and correlation on a purpose-built processor (based on either ASICs or FPGAs). 
We have developed a correlator that is based on software known as DiFX (Distributed 
FX), which runs within a generic multi-processor computing environment. Such a correlator 
i nterfaces natu rally to modern har d-disk data recordi ng systems, such as the MkV system 
( 1Whitneyll2002l ) and the K5 system (IKondo et al.ll2003l ). that have now largely replaced tape- 
based recording systems. Specifically, we have developed this software correlator to support 
a new disk-based VLBI recording system that has been deployed across the Australian Long 
Baseline ArrajQ (LBA) for VLBI. We refer the reader to a d etailed discussion of the LBA 
hard-disk recording system (LB ADR) that appears elsewhere (jPhillips et al.l 120071 . in prepa- 
ration). As our software correlator is more broadly applicable than to just the LBA, we will 
not dwell on the details of the LBA recording system in this paper, but rather concentrate on 
the characteristics, benefits, and performance of our software correlator, giving brief details of 
the recording system when required. The correlator source code, binaries and instructions for 
use are available for download from http: //astronomy . swin.edu.au/~adeller/software/difx/. 

The very first VLBI observations were in fact correlated using software on a main- 
frame computer. Software correlators were developed si multaneously on an IBM 360/50 
at the National Radio Astronomy Observato rv rNRAO) ([Bare et al.l 119671 ) and on an IBM 
360/92 at the Goddard Space Flight Centre ( Moran et al. Il967 ). As the early experiments 
quickly increased in complexity the recorded data volume also increased and it became nec- 
essary to design custom hardware for VLBI correlation. R ecent examples of s uch correlators 
include: the NRAO Very Long Baseline Arra y correlator (INapier et al.lll994l ): the Joint In- 
stitute for VLBI in E urope (JIVE) correlator (ICasselll999l ): the Canadian NRC S2 correlator 
( ICarlson et al.lll999l ): the Japanese VLBI Space Observatory Programme (VSOP) correlator 



^http://www. atnf.csiro.au/vlbi 
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( Horiuchi et al.l 120001): and the A ustralia Telescope National Facility (ATNF) S2 correlator 



( jWilson. Roberts &: David Il996l ). Table [T] compares some of the basic properties of some 



currently-operational hardware VLBI correlators. 

Recently, the pace of development of commodity computing equipment (processors, stor- 
age, networking etc) has outstripped increases in VLBI computational requirements to the 
point that the correlation of VLBI data using relatively inexpensive supercomputer facili- 
ties is feasible. The correlation algorithm is "embarrassingly parallel" and very well suited 
to such parallel computing architectures. These facilities are not purpose-built for corre- 
lation but are inherently multi-purpose machines, suited to a wide range of computational 
problems. 

This approach to correlation gives rise to significant scientific benefits, under certain 
circumstances. The benefits stem from the basic characteristics of correlation, software engi- 
neering considerations, and the computing environments. Software is more flexible and easier 
to redesign than application-specific hardware or even FPGA-based processors (although the 
programming tools for FPGAs are developing rapidly). The highly parallel nature of the 
correlation problem, coupled with the availability of high-level programming languages and 
optimised vector libraries means that a reasonably general software correlator code can be 
written quickly and be used in a variety of different computing environments with minimum 
modification, or in a dynamic environment where computing resources and/or significant 
scientific requirements can change rapidly with time. 

However, the trade-off for flexibility and the convenience of high-level programming tools 
is reduced efficiency for any given task, compared to an application-specific or FPGA-based 
solution. Put simply, the Non-Recoverable Engineering (NRE) costs for a software correlator 
are much lower than for a hardware correlator, but the cost per unit processing power is 
higher. Thus, the limited computation needed by a small size correlator means a software 
approach will be cheaper overall, while the tremendous computational requirements of corre- 
lators on the scale required for the Expanded Very Large Array (EVLA) or Atacama Large 
Millimetre Array (ALMA) dictate that the substantial amounts of NRE spent optimising 
hardware are worthwhile, at least in 2006. 

Software also has an advantage over hardware if the additional support required for 
unusual or stringent VLBI experiments is impossible or impractical to implement in an 
existing hardware correlator. An example of this is given in §4.3[ Use of a software correlator 
in these cases, even at possibly reduced efficiency, is preferable to the expense of building or 
altering dedicated hardware. 



A good example of the flexibility of software correlation and its trade-off with efficiency is 
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spectral resolution capability. A generic modern CPU is capable of calculating multi-million 
point one- dimensional Fast Fourier Transforms (FFTs), allowing an FX style software corre- 
lator utilising this CPU as a processing element to give extremely high frequency resolution: 
a million spectral points across the frequency bandwidth of an observation. 

Such a correlation would be computationally intensive, as conventional CPUs are not 
optimised for such operations. However, it could be carried out using exactly the same 
software and hardware as is used for a generic continuum experiment. Comparison to Table 
[T] shows that such high spectral resolution is currently impossible on existing hardware 
correlators. A number of limitations on particular hardware correlator implementations, 
such as minimum integration times, maximum input data rates, and maximum output data 
rates, can be overcome in a similar fashion with software correlators. 

The flexibility, inexpensive nature, and ease of production of software correlators makes 
them particularly useful for small to medium sized VLBI arrays, since development times 
are short, costs are low, and the capabilities are high, providing niche roles for even small 
facilities. These factors have led to a resurgence in software correlator applications in a 
number of groups around the world. In addition to the efforts described here at the Swin- 
burne University of Technology, a group have developed a software correlato r, mainly for 



geod etic VLBI, at the Communications Research Laboratory (CRL) in Japan (IKondo et al. 



20031 ). This CRL code is also used for real-time fringe checks during observations on the 
European VLBI Network (EVN), operated from JIVEo. Also at JIVE, a software correlator 
has been developed and used to process VLBI observations that tracked the Huygens probe 



as it entered the atmosphere of Titan (jPogrebenko et al.ll2003l ). Spacecraft tracking with 



VLBI and software correlation is likely to become a more recognised technique following the 
Huygens success, for example for the Chinese Chang'E lunar missioiifj. Finally, the most 
ambitious example of a software correlator is the Low Frequency Array (LOFAR) correlator, 
which is implemented on an IBM BlueGene/L supercomputer containing 12,000 processor^ 
This software correlator rivals the most powerful hardware correlators currently operating 
or in the design stage, but differs from the software correlator described in this paper in that 
hardware specific optimisations and large amounts of NRE were utilised. 

The approach we used in the development of the software correlator was largely inspired 
by the previous success of a group at Swinburne who developed baseband sig nal processin g 



software for multi-processor environments, for the purposes of pulsar studies (IBailesI l2003l ). 



^Details about the process and results can be found at "http: / /www. evlbi.org/evlbi/tevlb8 / tevlb8.html 
■^littp:/ /en. cast.cn 
"^http: / /www. lofar.org 



A prototype software corr elator develope d at S winburne is described in IWestI ( 12004 ). with 
initial results described in lHoriuchi et al.l (120061 ) . 



In this paper we concentrate on a description of the DiFX software correlator for VLBI 
developed at the Swinburne University of Technology, motivated by the factors discussed 
above. This correlator has been used as part of the Australia Telescope National Facility 
(ATNF) VLBI operations since 2005 and has now replaced the previously used ATNF S2 
correlator. The particular architecture we have adopted (§2.1, 2.2 and 2.3), is discussed only 
briefly, as the correlation algorithm has been discussed at length in the literature. §3 describes 
the DiFX correlator, including the details of the software implementation, verification results 
from comparisons with two established hardware correlators, and performance figures-of- 
merit. We illustrate some examples of specific scientific applications that can benefit from 
software correlation in §4. Finally, our conclusions are presented in §5. 



The FX software correlator architecture 



Many previous wo rks develop in detail the theory of radio interferometry (IThompson. Moran &: Swenso 



I994J : iThompsonI Il999l ) . The reader is referred to these texts for a complete discussion of 
the technique. Here we discuss the main steps used to implement the correlator architecture 
(FX) that we have adopted. 



A more extensive overview of correlator operations is given in iRomneyl (119991 ). We 
do not describe the operations at the telescopes that convert the incident electric field at 
sky frequency to the filtered, down-converted, sampled, and digitised data streams that are 
recorded to disk (baseband data in our terminology). 

A number of the initial operations are made on the telescope-based data streams. A 
number of the later operations are baseline-based. These two sets of operations are briefly 
described separately and in sequence. 



2.1. Antenna-based operations 

2.1.1. Alignment of telescope data streams 

To correlate data from a number of different telescopes, the changing delays between 
those telescopes must be calculated and used to align the recorded data streams at a prede- 
termined point in space (in this case the geocentre) throughout the experiment. 
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The Swinburne software correlator uses CALC to generate a geometric delay model 
(r(i)) for each telescope in a given observation, at regular intervals (usually 1 second). 
CALC models many geometric effects, including precession, nutation, ocean and atmospheric 
loading, and is used by many VLBI correlators including the VLBA and JIVE correlators. 
These delays are then interpolated (using a quadratic approximation) to produce accurate 
delays (Ar < 1 x 10~^^ sec, compared to an exact CALC value) in double precision for any 
time during the course of the observation. The estimated station clock offets and rates are 
added to the CALC-generated geometric delays. 

The baseband data for each telescope are loaded into large buffers in memory, and the 
interpolated delay model is used to calculate the accurate delay between each telescope and 
the centre of the Earth at any given time during the experiment. This delay, rounded to 
the nearest sample, is the integer sample delay. The difference between the delay and the 
integer sample delay is recorded as the antenna based fractional sample delay (up to ± 0.5 
sample). Note that the alignment of any two data streams (as opposed to a data stream 
alignment with the geocentre) is good to ± 1 sample. 

The integer-sample delay is used to offset the data pointer in memory and select the 
data to be correlated (some number of samples which is a power of 2, starting from the time 
of alignment). The fractional sample error is retained to correct the phase as a function 
of frequency following alignment to within one sample, fringe rotation, and channelisation 
(§2.L3). 

Once the baseband data for each telescope have been selected, they are transferred to 
a processing node and unpacked from the coarsely quantised representation (usually a 2-bit 
representation) to a floating point (single precision) representation. From this point on, all 
operations in the correlator are performed using floating point arithmetic, in single precision 
unless otherwise specified. Note that the data volume is expanded by a factor of 16 at 
this point. The choice of single precision floats (roughly double the precision necessary) 
was dictated by the capabilities of modern CPUs, which process floats efficiently. Using 
sufficient precision also avoids the small decorrelation losses incurred by optimised, low 
precision operations often used in hardware correlators. This is a good example of the 
sacrifice of efficiency for simplicity and accuracy with a software correlator. 

At this point all data streams from all telescopes are aligned to within ± 1 sample 
of each other and the fractional sample errors for each of the telescope data streams are 
recorded for later use. A set number of samples from each telescope data stream have been 
selected and are awaiting processing on a common processing node (e.g. a PC in a Beowulf 



■"^http: / /gemini. gsfc.nasa.gov/solve 
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cluster) . 



2.1.2. Fringe rotation 

Fringe rotation compensates for the changing phase difference introduced by delaying 
the signal from each telescope to the geocentre after it has been downconverted to baseband 
frequencies. If the changing delay, r(t), could be compensated for at sky frequency, fringe 
rotation would not be required. This, however, is impractical. 

The necessary fringe rotation function can be calculated at any point in time by taking 
the sine and cosine of the geocentric delay multiplied by the sky frequency pq] it is applied 
via a complex multiplication for each telescope's data stream. 

Since the baseband data have already been unpacked to a floating point representation 
by this stage, a floating point fringe rotation is applied which yields no fringe rotation losses, 
compared, for example, to a 6.25% loss of signal to noise for three level digital fringe rotation 



in a two level complex correlator (iRobertd 119971 ) 



Implemented as such, fringe rotation represents a mixing operation and will result in a 
phase difference term which is quasi-stationary at zero phase (the desired term) and a phase 
sum term which has a phase rate of twice the fringe rotation function, ~ 47rz/or(t). The sum 
term vector averages to a (normally) negligible contribution to the correlator; for typical 
VLBI fringe rates (100s of kHz) and integration times (seconds) the relative magnitude of 
the unwanted contribution to each visibility point is < 10~^. In a software correlator it 
would be simple to control the integration time so that the rapidly varying phase term is 
integrated over exactly an integral number of terms of phase, thus making no contribution 
to the correlator output. This feature is not currently implemented in DiFX. 

We have thus far described fringe rotation as a phase shift for each sample in the time 
domain. If performed in this manner, we refer to the fringe rotation as "pre-F" (under an 
FX architecture), as it has been applied before the transformation to the frequency domain 
in the channelisation process ( §2.1.3p . In this case, the geometric delay for each sample is 
interpolated using the delay model as described in §2.1.11 above. 

In cases where the fringe rotation to be applied changes little from the first sample in the 
FFT window to the last, a minimal amount of decorrelation is introduced by applying a single 
fringe rotation for the entire window. The decorrelation can be estimated by sinc(A0/2), 
where A0 = 2'kvqAt is the change in baseline phase due to Earth rotation over the FFT 
window. 



- 9 - 



In this way, fringe rotation can be applied after channelisation, which saves considerable 
computational effort ("post-F" fringe rotation). For this approach to be viable, the fringe 
rates should be low (ie low frequencies and/or short baselines) and the number of channels 
should be small (implying that the time range of the samples to be correlated is short 
compared to the fringe period). Table [2] shows the degree of decorrelation which would be 
incurred by utilising post-F fringe rotation for a range of VLBI observation modes. This 
decorrelation is simple to calculate and could be used to correct the visibility amplitudes and 
alter visibility weights, although this is not presently implemented in DiFX. It is important 
to note that the use of post-F fringe rotation is not recommended for all situations shown 
in Table [21 and indeed is only intended for use when the resultant decorrelation is <^ 1%. 

Post-F fringe rotation is desirable in situations where the fringe rate is extremely low, 
when the double-frequency term introduced by the mixing operation of pre-F fringe rotation 
is not effectively averaged to zero over the course of an integration and makes a significant 
and undesirable contribution to the correlator output. Switching from pre-F to post-F 
fringe rotation would be beneficial for periods of time in most experiments when the source 
traverses periods of low phase rate. Sources near a celestial pole can have very low fringe 
rates for long periods of time. Alternatively, if very short correlator integration times are 
used, the sum term may not integrate to zero when using pre-F fringe rotation. Post-F 
fringe rotation would therefore be a natural choice in these circumstances. 

It should be noted that it is possible to undertake the exact equivalent to pre-F fringe 
rotation in the frequency domain. However, this would involve the Fourier transform of 
the fringe rotation function and a convolution in the frequency domain, which is at least as 
computationally intensive as the complex multiplication of the data and fringe rotation in 
the time domain. 

DiFX implements pre-F or post-F fringe rotation as a user controlled option. 



2.1.3. Channelisation and fractional sample error correction 



Once the data are aligned and phase corrected after fringe rotation, the time series data 
are converted into frequency series data (channelised), prior to cross multiplication. 

Channelisation of the data can be accomplished using an FFT (Fast Fourier Transform) 
or a digital filterbank. If used, the filterbank is impl emented in a polyphase fa shion, which 
essentially inserts a decomposed filter before an FFT (jBellanger fc Daguetll2004j ) . This allows 
the channel response to be changed from the sinc^ response natural to a FX correlator to 
any desired function. In practise, an approximation to a rectangle is applied, although the 
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length of the filter (and hence the accuracy of the approximation) is tunable. 

If pre-F fringe rotation has been applied, the data are already in complex form, and 
so a complex-to-complex FFT is used. The positive or negative frequencies are selected in 
the case of upper or lower sideband data respectively. If post-F fringe rotation is to be 
applied, the data are still real and so a more efficient real-to-complex FFT may be used. 
This is possible due to the conjugate symmetry property of an FFT of a real data series. In 
this case, lower sideband data may be recovered by reversing and conjugating the resultant 
channels. 



The final station-based operation is fractional-sample correction (lRomneylll999l ). This 
step is considerably easier in an FX correlator than an XF implementation, since the con- 
version to the frequency domain before correlation allows the fractional error to be corrected 
exactly, assuming the error to be constant over an FFT length. This is equivalent to the 
assumption made for post-F fringe rotation, but is considerably less stringent since the phase 
change is proportional to the subband bandwidth, rather than sky frequency as in the case 
of fringe rotation. The frequency domain correction manifests itself as a slope in the phase 
as a function of frequency across the observed bandwidth. 

Thus, after channelisation, a further complex multiplication is applied to the channels, 
correcting the fractional sample error. In the case of post-F fringe rotation, the fringe 
rotation value is added to the fractional-sample correction and the two steps are performed 
together. 

Either simple FFT or digital polyphase filter bank channelisation can be selected as a 
user controlled option in DiFX. 



2.2. Baseline-based operations 

2.2.1. Cross multiplication of telescope data streams 

For each baseline, the channelised data from the telescope pair are cross-multiplied on a 
channel by channel basis (after forming the complex conjugate for the channelised data from 
one telescope) to yield the frequency domain complex visibilities that are the fundamental 
observables of an interferometer. This is repeated for each common band/polarisation on a 
baseline, and for all baselines. If dual polarisations have been recorded for any given band, 
the cross-polarisation terms can also be multiplied, allowing polarisation information for the 
target source to be recovered. 



2.2.2. Integration oj correlated output 



Once the above cycle of operations has been completed, it is repeated and the resulting 
visibilities accumulated (complex added) until a set accumulation time has been reached. 
The number of "good" cycles per telescope is recorded, which could form the basis of a 
data weighting scheme, although weights are not currently recorded in DiFX. Generally, 
on each cycle the input time increment is equal to the corresponding FFT length (twice 
the number of spectral points), but it is also possible to overlap FFTs. This allows more 
measurements of higher lags and greater sensitivity to spectral line observations, at the cost 
of increased computation. In this way, the limiting time accuracy with which accumulation 
can be performed is equal to the FFT length divided by the overlap factor. A caveat to this 
statement is discussed in 53.4. 



2.2.3. Calibration for nominal telescope T, 



sys 



Cross multiplication, accumulation and normalisation by the antenna autocorrelation 
spectra gives the complex cross power spectrum for each baseline, representing the correlated 
fraction of the geometric mean of the powers detected at each telescope. To obtain the 
correlated power in units of Jy, the cross power spectra (amplitude components) should be 
scaled by the geometric mean of the powers received at each telescope measured in Jy i.e. 
the Tsys in Jy routinely measured at each antenna. Calibration based on the measured Tgy^ is 
typically performed as a post-correlation step in AlP^or a similar data analysis package, and 
so a nominal value for the Tgyg for each telescope is applied at the correlator. In addition, a 
scaling factor to compensate for decorrelation due to the coarse quantisation of the baseband 
data is applied. This corrects the visibility amplitudes, but of course cannot recover the lost 
signal to noise. For the 2-bit data typically processed, this scaling factor is 1/0.88 in the 



low-correlation limit (jCoopeiill970l ). The relationship becomes non-linear at high correlation 
and the scaling factor approaches unity as the correlation coeffient approaches unity. The 
correction for high-correlation cases can be applied in post-processing, generally at the same 
time as the application of measured Tsys values. 



http: / /www. aoc.nrao.edu/aips 
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2.2.4- Export of visibility data 

Once an accumulation interval has been reached, the visibilities must be stored in a 
useful format. Presently, the .ottwate correlator supports RPFITS0 as the output format. 
RPFITS files can be loaded into analysis packaged .uch a. MPS. CASaE or MIRIaHI 

for data reduction. Ancillary information is included in the RPFITS file along with the 
complex visibilities, time stamps, and (u,v,w) coordinates. The RPFITS standard supports 
the appending of a data weight to each spectral point, but DiFX does not currently record 
weights. In the future, it is planned to add additional widely used output formats, such as 
FITS-ID10. 



2.3. Special processing operations: pulsar binning 

Pulsed signals are dispersed as they travel through the interstellar medium (ISM), re- 
sulting in a smearing of the pulse arrival time in frequency. In orde r to correct for th e 



dispersive effects of the ISM, DiFX employs incoherent dedispersion (jVoTlte et al.l |2002| ). 
This allows the visibilities generated by the correlator to be divided into pulse phase bins. 
Unlike hardware correlators which typically allow only a single on/off bin, or else employ 
2^ bins of fixed width, DiFX allows an arbitrary number of bins placed at arbitary phase 
intervals. The individual bins can be written out separately in the RPFITS file format to 
enable investigation of pulse phase dependent effects, or can be filtered within the correlator 
based on a priori pulse profile information. 

To calculate which phase bin a visibility at a given frequency and time corresponds 
to, the software correlator requires information on the pulsar's ephemeris, which is supplied 
in the form of one or more "polyco" files containing a polynomial description of apparent 
pulse phase as a function of time. These are generated using the pulsar analysis program 
TEMPO0, and require prior timing of a pulsar. Additional software has been written by 
the authors to verify the pulsar timing, using the generated polyco files and the baseband 
data (in MkV, LBA or K5 format) from an experiment, allowing phase bins to be accurately 



^http://www. atnf.csiro.au/computing/software/rpfits.html 
^http:/ /casa. nrao.edu/ 

^http://www. atnf.csiro.au/computing/software/miriad 
^"http: / / www.aoc.nrao.edu/aips /FITS-IDI.html 
^^http: / /pulsar. princeton.edu/tempo/reference_manual. html 
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set before correlation. 

For VLBI observations of pulsars, it is usually desirable to maximise the signal to noise 
of the observations by binning the visibilities based on the pulse phase, and applying a filter 
to the binned output based on the signal strength in that phase. Typically this filter is 
implemented as a binary on/off for each phase bin. Using the pulse profile generated from 
the baseband data of an observation, however, DiFX allows a user-specified number of bins to 
be generated and a filter applied based on pulse strength x bin width, allowing the maximum 
theoretical retrieval of signal, as described below. This also reduces the output data volume, 
since only an "integrated on-pulse" visibility is retained, rather than potentially many phase 
bins. 

Consider observing a single pulse, divided into M equally spaced phase bins. Let the 
pulsar signal strength as a function of phase bin be S{m), and the noise in single phase bin 
to be Z X \/M, where Z is the baseline sensitivity for an integration time of a single pulse 
period. When all bins are summed (effectively no binning), the S/N ratio will be: 



z ^ ' 

as the signal adds coherently while the noise adds in quadrature. For a simple on/off gate 
accepting only bins mi to m2, the S/N ratio will be: 



Finally, for the case where each bin is weighted by the pulse signal strength in that bin, 
the S/N ratio will be: 

yEf=o {s{m)xZx^{M)f 

For a Gaussian shaped pulse, this allows a modest improvement in recovered signal to 
noise of 6% compared to an optimally placed single on/off bin. On a more complicated 
profile, such as a Gaussian main pulse with a Gaussian interpulse at half the amplitude, the 
improvement in recovered signal to noise increases to 21%. 
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3. Software correlation on the Swinburne Beowulf cluster - a case study 

3.1. The cluster computing environment 

The Swinburne University of Technology supercomputer is a ~300 processor Beowulf 
cluster, that is a mixture of commodity off-the-shelf desktop and server style PCs, connected 
via a gigabit ethernet network. In particular, the supercomputer has five sub-clusters, each 
with 48 machines. Four sub-clusters are made up of single processor 3.2 GHz, Pentium 4 PCs 
with 1 GB of RAM per machine, while one sub-cluster is made up of dual processor Xeon 
servers, each with 2 GB of RAM per machine. The cluster is continuously upgraded and 
fully replaced approximately every 3-4 years. The software correlation code must operate in 
this multi-user, multi-tasking, and highly dynamic environment. 



3.2. Structure of the DiFX code 

DiFX is written in C++, but makes heavy use of the optimised vector processing routines 
provided by the Intel Performance Primitive (IPP) librarjo. The use of this optimised vector 
library results in a factor of several performance gain on the Intel CPUs, compared to non- 
optimised vector code. Data transfer is handled via the Message Passing Interface (MPI) 
standard^. The mpich implementation of MPI is usecj^. 

Figure [1] shows the high-level class structure of DiFX, along with the data fiow. The 
correlation is managed by a master node (FxManager), which instructs data management 
nodes (Datastream) to send time ranges of baseband data to processing nodes (Core). The 
data are then processed by the Core nodes, and the results sent back to the FxManager. 
Double buffered, non-blocking communication is used to avoid latency delays and maximise 
throughtput. Both the Datastream and Core classes can be (and have been) extended to 
allow maximum code re-use when handling different data formats and processing algorithms. 
The Core nodes make use of an allocatable number of threads to maximise performance on 
a heterogenous cluster. 

The Datastream nodes can read the baseband data into their memory buffers from a local 
disk, a network disk or a network socket. Once the data are loaded into the datastream buffer, 
the remainder of the system is unaware of its origin. This is one of the most powerful aspects 



^^http://www. intcl.com/cd/software/products/asmo-na/eng/perflib/ipp/index. htm 

^■^http://www-imix.mcs.anl.gov/mpi/ 

"'^'^http: / / www-unix.mcs.anl.gov / mpi / mpiclil / 
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of this correlator architecture, meaning the same correlator can easily be used for production 
disk-based VLBI correlation and real-time eVLBI testing, where the data is transmitted in 
real time from the telescopes to the correlator over optical fibre. Real-time eVLBI operational 
modes have been tested using DiFX, transmitting data in real-time from the three ATNF 
telescopes (Parkes, ATCA, and Mopra) to computing resources at the Swinburne University 
of Technology and the University of Western Australia in Perth (a Cray XD-1 utilising 
Opteron processors and on-board Xilinx FPGAs). The software correlator then correlates 
the transmitted data in real-time. A full account of the new eVLB I capabilities of the 
Australian VLBI array will be presented elsewhere (jPhillips et al.ll2007l . in preparation). 



3.3. Operating DiFX 

DiFX is controlled via an interactive Graphical User Interface (GUI), which calls the 
various component programs and helper scripts. The primary purpose of the GUI is to 
facilitate easy editing of the text files which configure the correlator, run external programs 
such as the delay model generator, and provide feedback while a job is running. Two files 
are necessary to run the actual correlator program. The first is an experiment configuration 
file, containing tables of stations, frequency setups, etc, analogous to a typical hardware 
correlator job configuration script. The second file contains the list of compute nodes on 
which the correlator program will run. 

While it is possible to run all tasks required to operate the correlator manually, in prac- 
tise they are organised via the GUI. This consists of running a series of helper applications 
from the GUI to generate the necessary input for the correlator. These include a script to 
extract experiment information from the VLBI exchange (VEX) file used to configure and 
schedule the telescopes at observe time, a delay and (u,v,w) generator which makes use of 
CALC 9, and scripts to extract the current load of available nodes. Pulsar-specific informa- 
tion such as pulse profiles and bin settings can also be loaded. This information is presented 
via the GUI and adjustments to the configuration, such as selection computational resources 
to be used, can be made before launching a correlation job. 

In the future it is planned to incorporate some real-time feedback of amplitude, phase 
and lag information from the current correlation via the GUI. This would be similar to the 
visibility spectra displays available continuously at connected-element interferometers. 
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3.4. Performance 

In order to keep every compute node used in the correlation fully loaded, they must be 
kept supplied with raw data. If this condition is satisfied, we have a CPU-limited correlation, 
and the addition of further nodes will result in a linear performance gain. In practise, 
however, at some point obtaining data from the data source (network socket or disk) and 
transmitting it across the local network to the processing nodes will no longer occur quickly 
enough, and the correlation becomes data-limited rather than CPU-limited. Correct selection 
of correlation parameters, and good cluster design, will minimise the networking overhead 
imposed on a correlation job, and ensure that all compute nodes are fully utilised. This is 
discussed in §3.4.11 below, and performance profiles for the CPU-limited case are presented 
in gSaSl 

3.4- 1- Networking considerations 

As described in §3.2[ double-buffered communications to the processing nodes are used 
to ensure that nodes are never idle as long as sufficient aggregate networking capability is 
available. The use of MPI communications adds a small but unavoidable overhead to data 
transfer, meaning the maximum throughput of the system is slightly less than the maximum 
network capacity on the most heavily loaded data path. 

There are two significant data fiows: out of each Datastream and into the FxManager. 
For any high speed correlation, there will be more Core nodes than Datastream nodes, so 
the aggregate rate into a Core will be lower than that out of a Datastream. The fiow out of 
a Core is a factor of Ncoies times lower than that into the FxManager node. 

If processing in real time (when processing time equals observation time), the rate 
out of each Datastream will be equal to the recording rate, which can be up to 1 Gbps with 
modern VLBI arrays and is within the capabilities of modern commodity ethernet equipment. 
The rate into the FxManager node will be equal to the product of the recording rate, the 
compression ratio, and the number of Cores, where the compression ratio is the ratio of data 
into a Core to data out of a Core. This is determined by the number of antennas (since 
number of baselines scales with number of antennas squared), the number of channels in 
the output cross-power spectrum, the number of polarisation products correlated, and the 
integration time used before sending data back to the FxManager node. 

It is clearly desirable to maximise the size of data messages sent to a core for processing, 
since this minimises the data rate into the FxManager node for a given number of Cores. 
However, if the messages are too large, performance will suffer as RAM capacity is exceeded. 
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Network latency may also become problematic, even with buffering. Furthermore, it should 
be apparent that in this architecture, the Cores act as short-term accumulators (STAs), 
with the manager performing the long term accumulation. The length of the STA sets 
the minimum integration time. It is important to note, however, that the STA interval is 
entirely configurable in the software correlator, to be as short as a single FFT, although 
network bandwidth and latency are likely to be limiting factors in this case. 

For the majority of experiments it is possible to set a STA length which satisfies all the 
network criteria and allows the Cores to be maximally utilised. For combinations of large 
numbers of antennas and very high spectral and time resolution, however, it is impossible 
to set an STA which allows a satisfactorily low return data rate to the FxManager node. In 
this case, real time processing of the experiment is not possible without the installation of 
additional network and/or CPU capacity on the FxManager node. 

It is important to emphasise that although it is possible to find experimental configura- 
tions for which the software correlator suffers a reduction in performance, these configurations 
would be impossible on existing hardware correlators. If communication to the FxManager 
node is limiting performance, it is also possible to parallelize a disk-based experiment by 
dividing an experiment into several time ranges and processing these time ranges simultane- 
ously, allowing an aggregate processing rate which equals real time. This is actually one of 
the most powerful aspects of the software correlator, and one which would allow scheduling 
of correlation to always ensure the cluster was being fully utilised. 

3.4-2. CPU-limited performance 

Figure [2] shows the results of performance testing on the Swinburne cluster (using the 
3.2 GHz Pentium 4 machines and the gigabit ethernet network) for different array sizes and 
spectral resolutions. The results shown in Figure [2] were obtained for data for which the 
aggregate bandwidth was 64 MHz, broken up into 8 bands each of 8 MHz bandwidth (4 x 
dual polarisation 8 MHz bands: data were 2-bit sampled: antenna data rate 256 Mbps). 
Node requirements for real-time operation are extrapolated from the compute time on an 
8 node cluster. The correlation integration time is 1 second and all correlations provide all 
four polarisation products. RAM requirements per node ranged from 10 - 50 MB depending 
on spectral resolution, showing that large amounts of RAM are unnecessary for typical 
correlations. It can be seen that even a modestly sized commodity cluster can process a 
VLBI-sized array in real time at currently available data rates. 
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3.5. Correlator comparison results 

3.5.1. Comparison with ATNF S2 correlator 

Observations to provide data for a correlator comparison between the Swinburne soft- 
ware correlator and the ATNF S2 correlator were undertaken on March 12, 2006, with the 
following subset of the LBA: Parkes (64 m), ATCA (phased array of 5 x 22 m), Mopra (22 
m), Hobart (26 m). 

D ata from these obser vations were recorded simultaneously to S2 tapes and the LBADR 



disks (iPhillips et al.l 120071 . in preparation) during a 20 minute period, UT 02:30-02:50, cor- 
responding to a scan on a bright quasar (PKS 0208—512). The data recorded corresponded 
to two 16 MHz bands, right circular polarisation (RCP), in the frequency ranges 2252 — 
2268 MHz and 2268 - 2284 MHz. 



The data recorded on S2 tapes were shipped to the ATNF LBA S2 correlator ((Roberts 



19971 ) at ATNF headquarters and processed. The data recorded to LBADR disks were 
shipped to the Swinburne University of Technology supercomputer and processed using the 
software correlator. 

At both correlators identical Tgys values in Jy were specified for each antenna and applied 
in order to produce nominally calibrated visibility amplitudes. Further, both correlators used 
identical clock models, in the form of a single clock offset and linear rate as a function of time 
per antenna. Finally, the data were processed at each correlator using 2 second correlator 
integration times and 32 spectral channels across each 16 MHz band. 

Different implementations of the CALC-based delay generation were used at each corre- 
lator, meaning small differences exist in the delay models used, leading to differences in the 
correlated visibility phase. We have calculated the delay model differences and subtracted 
the phase due to differential delay model in the following discussion. 

Fro m both correlators, RPFITS form at data were output and loaded into the MIRIAD 



software (jSault. Teuben. fc Wrightlll995l ) for inspection and analysis. The data from the two 



correlators are compared in a series of Figures below (Figures 131-E]). 

Figure [3] shows the visibility amplitudes for all baselines from both correlators as a 
function of time, over the period 02:36:00 - 02:45:00 UT, for one of the 16 MHz bands (2252 
— 2268 MHz). These amplitudes represent the vector averaged data over the frequency 
channel range 10 — 21 (to avoid the edges of the band). The data for each baseline were 
fit to a first order polynomial model {S{t) = ^t + 5*0, where 5" is the fiux density in Jy, t 
is the offset in seconds from UT 02:40:30, and 5*0 is the extrapolated flux density at time 
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UT 02:40:30, using a standard linear least squares routine. The root mean square (RMS) 
variation around the best fit model was calculated for each baseline. The fitted models are 
shown in Figure [3] and show no significant differences between the S2 correlator and the 
software correlator. Further, the calculated RMS for each baseline agrees very well between 
DiFX and the S2 correlator, as summarised in Table [31 

Figure H] shows the visibility phase as a function of time for each of the six baselines in 
the array. Again the data represent the vector averaged correlator output over the frequency 
channel range 10 — 21 within the 2252 — 2268 MHz band. As discussed above, small 
differences between the delay models used at each correlator have been taken into account 
as part of this comparison. 

Figure O shows a comparison of the visibility amplitudes and phases as a function of 
frequency in the 2252 — 2268 MHz band. The data represented here result from a vector 
average of the two datasets over a two minute time range, UT 02:40:00 — 02:42:00. Since 
the S2 correlator is an XF - style correlator, it cannot exactly correct fractional sample error 
in the same manner as an FX correlator such as DiFX, as the channelisation is performed 
after accumulation. The coarse (post-accumulation) fractional sample correction leads to 
decorrelation at all points except the band center, up to a maximum of ~ 10% at the band 
edges on long baselines where the geometric delay changes by a sample or more over an 
integration period. We have corrected for this band edge decorrelation in the S2 correlator 
amplitudes in Figure O 



3.5.2. Comparison with the VLB A correlator 



Data obtained as part of a regular series of VLBA test observations were used as a 
basis for a correlato r comparison between the software correlator and the VLBA correlator 
( iNapier et al.lll994l ). The observations were made on 2006 August 05 using the Brewster, 
Los Alamos, Mauna Kea, Owens Valley, Pie Town, and Saint Croix VLBA stations. One bit 
digitised data sampled at the Nyquist rate for fou r dual polarisat ion bands, each of 8 MHz 
bandwidth, were recorded using the Mk5 system (jWhitneyl 120031 ) . The four bands were at 
centre frequencies of 2279.49, 2287.49, 2295.49, and 2303.49 MHz. The experiment code for 
the observations was MT628 and the source observed was 0923+392, a strong and compact 
active galactic nucleus. Approximately two minutes of data recorded in this way was used 
for the comparison. 



The Mk5 data were correlated on the VLBA correlator and exported to FITS format 
files. The data were also shipped to the Swinburne supercomputer and correlated using the 
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software correlator, the correlated data exported to RPFITS format files. In both cases, no 
scaling of the correlated visibility amplitudes by the system temperatures were made at the 
correlators. The visibilities remained in the form of correlation coefficients for the purposes 
of the comparison i.e. a system temperature of unity was used to scale the amplitudes. Each 
8 MHz band was correlated with 64 spectral points, and an integration time of 2.048 seconds 
was used. 

The VLBA correlator data were read into AIPS using FITLD with the parameter 
DIGIC0R=1. The DIGICOR parameter is used to apply certain scalings to the visibil- 
ity amplitudes for data from the VLBA correlator. Further, to obtain the most accurate 
scaling of the visibility amplitudes, the task ACCOR was used to correct for imperfect sam- 
pler thresholds, deriving corrections to the antenna-based amplitudes of ~ 0.5%. These 
ACCOR corrections were apphed to the data and the data were written to disk in FITS 
format. 

The software correlator data were read directly into AIPS and then written to disk in 
the same FITS format as the VLBA correlator data. No corrections to amplitude or phase 
of the software correlated data were made in AIPS. 

The VLBA correlator data and the software correlator data were both imported into 
MIRIAD for inspection and analysis, using the same software as used for the comparison 
with the LBA correlator described above. RCP from the 2283.49 - 2291.49 MHz band over 
the time range UT 17:49:00 — 17:51:00 was used in all comparison plots below. 

Since the delay models used by the VLBA and software correlators differ at the pi- 
cosecond level, as is the case for the comparison with the LBA data in §3.5. H differences in 
the visibility phase exist between the correlated datasets. As with the LBA comparison, we 
have compensated for the phase error due to the delay models differences in the following 
comparison. 

Figure [6] shows the visibility amplitudes for all baselines from both correlators as a 
function of time. These amplitudes represent the vector averaged data over the frequency 
channel range 10 — 55 (to avoid the edges of the band). The data for each baseline were fit to 
a first order polynomial model {S{t) = ^t + So, where S is the correlation coefficient, t is the 
offset in seconds from UT 17:50:00, and 5*0 is the extrapolated correlation coefficient at time 
UT 17:50:00) using a standard linear least squares routine. The root mean square (RMS) 
variation around the best fit model was calculated for each baseline. The fitted models 
are shown in Figure El and show no significant differences between the VLBA correlator 
and the software correlator. Further, the calculated RMS for each baseline agrees very well 
between the VLBA correlator and the software correlator. The results of the comparison are 
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summarised in Table HI 

Figure [7] shows the visibihty phase as a function of time for each of the fifteen basehnes in 
the array. Again the data represent the vector averaged correlator output over the frequency 
channel range 10 — 55 within the band. As discussed above, small differences between the 
delay models used at each correlator cause phase offsets between the two correlators, and 
have been taken into account as part of this comparison. 

Figure [S] shows a comparison of the visibility amplitudes and phases as a function of 
frequency in the band. The data represented here result from a vector average of the two 
datasets over a two minute time range. Figures [6|, [7] and [8] show that the results obtained 
by the VLBA correlator and DiFX agree to within the RMS errors of the visibilities in each 
case, as expected. 



4. Scientific applications of the Swinburne software correlator 

4.1. High frequency resolution spectral line VLBI 

As mentioned in the introduction, an attractive feature of software correlation is the ease 
with which very high spectral resolution correlation can be undertaken. This is particularly 
useful for studies of spectral line sources such as masers when mapping the di stribution of the 



masing regions and their kinematics i.e. near black holes in galactic nuclei (iGreenhill et al. 



19951) 



Figure [9] shows a spectrum obtained from an LBA observation of the OH maser G345— 0.2. 
These observations were made with an array consisting of the ATCA (phased array of 5 x 22 
m), Parkes (64 m), and Mopra (22 m), recording data from a dual-polarised (RCP and LCP) 
4 MHz band onto hard disk. The data were correlated using the software correlator with 
16,384 frequency channels across the 4 MHz band, corresponding to 0.25 kHz per channel 
or 0.038 km/s velocity resolution at 1.72 GHz. 

These results compare with recent very high spectral resolution work done with the 
VLBA. Fish et al. (2006) observed OH masers with the VLBA, using a 62.5 kHz bandwidth 
and 512 channels across this band to obtain channel widths of 0.122 kHz or 0.02 km/s velocity 
resolution. The velocity resolution of this correlated dataset is almost twice as good as that 
shown in Figure [9l However, the VLBA bandwidth is only 0.016 times the bandwidth of the 
observations shown in Figure M 



If required, DiFX could have correlated these data with 32,768 channels, 65,536 channels 
or even higher numbers of channels. As mentioned in the introduction, the only penalty is 
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compute time on a resource with a fixed number of processing elements. DiFX tfierefore lias 
a clear advantage over existing hardware correlators in terms of producing very high spectral 
resolution over wide bandwidths. This capability is useful if the velocity distribution of an 
ensemble of masers in a field is broad and cannot be contained in a single narrow bandwidth. 



4.2. Correlation for wide fields of wiew 

An application that takes advantage of the frequency and time resolution of the software 
correlator output is wide field imaging. To image a wide field of view, avoiding the effects 
of time and bandwidth smearing, high spectral and temporal resolution is required in the 
correlator visibility output. For example, at VLBI resolution (40 mas), to image the full 
primary beam of an Australia Telescope Compact Array (ATCA) antenna (22 m diameter) 
at a frequency of 1.4 GHz, requires a time resolution of the correlator output of 50 ms and a 
frequency resolution of 4 kHz (allowing a 0.75 % smearing loss at the FWHM of the primary 
beam) . 

Neither the JIVE nor the VLBA hardware correlators can achieve such high frequency 
or time resolution for continuum experiments, but DiFX can be configured for such modes 
in an identical manner to a normal continuum experiment. 



4.3. Pulsar studies 



As compact sources with high velocities, pulsars make excellent testbeds with which 
to probe the structure of the interstellar medium (ISM). Scintillation due to structure in a 
scattering screen between the observer and the pulsar causes variations i n the interferometri c 
visibilities, which have some dependence on time and frequency (e.g. iHewish et al.lll985l ). 



Naturally, pulsar binning is advantageous in these studies for maximising signal to noise 
ratios. 

The most stringent requirement for useful studies of pulsa r scintillation, however, is that 
of extremely high frequency resolution. iBrisken et al.l (120071 . in preparation) have recently 
demonstrated the capabilities of DiFX for this type of analysis with observations of the 
pulsar B0834-04. The NRAO Green Bank Telescope (100 m), Westerbork (14 x 25 m), 
Jodrell Bank (76 m), and Arecibo (305 m) were used to provide an ultra-sensitive array at 
327 MHz. The data were recorded using the Mk5 system and correlated on the Swinburne 
software correlator. The main requirement on the correlation was 0.25 kHz wide frequency 
channels, over the broadest bandwidth available, to maximise signal to noise. For these 
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observations a 32 MHz band was available. The Swinburne software correlator therefore 
correlated the data with 131,072 frequency channels across the band. 

No existing hardware correlator can provide such a high frequency resolution over such a 
wide bandwidth. F ull details of th e inter pretation of the B0834— 04 software correlated data 
will be available in iBrisken et al.l (120071 . in preparation). Shown in Figure [TOl is a section 
of the dynamic spectrum from this observation which shows the scintillation structure as 
functions of time and frequency. 



4.4. Geodetic VLBI 

In addition to astronomical VLBI, the software correlator can also be deployed for 
geodetic VLBI. Compared to astronomical VLBI, geodetic VLBI has additional requirements, 
including different output formats and the frequent use of sub-arraying. The flexibility and 
capabilities of the software correlator are well-matched to this task. 

The software correlator has been tested on geodetic datasets obtained using the Mk5 
recording system, consisting of 16 frequency bands. These tests form the basis of a geodetic 
correlation comparison between the software correlator and the geodetic correlator of the 
Max Plank Institut of Radioastronomi e in Bonn, German y. Full results of this correlator 



comparison will be reported elsewhere (ITingay et al.l 120071 . in preparation). 



In particular, in Australia a new three-station geodetic VLBI array has been funded 
as part of the geospatial component of the Federal Government's National Collaborative 
Research Infrastructure Scheme (NCRIS). This scheme provides for three new geodetic VLBI 
stations of 12 m diameter, Mk5 recording systems, and a modified version of the software 
correlator described in this paper. The modifications necessary to convert DiFX into a 
geodetic correlator consist of the addition of phase calibration tone extraction, a streamlined 
interface to scan-by-scan correlation for sub-arraying, and a capability to produce visibilities 
in a format convenient for geodetic post-processing. 

The new Australian geodetic VLBI array will participate in global geodetic observations, 
as well as undertaking experiments internal to the Australian tectonic plate. 



5. Conclusions 



In this paper we have outlined the main benefits of software correlation for small to 
medium sized VLBI arrays. They are: 
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• The development of software correlation is rapid and does not depend on an intimate 
knowledge of digital signal processing hardware, just the algorithms; 

• The software is flexible and scalable to accommodate a very broad range of interfero- 
metric modes of observation, including many which cannot be supported by existing 
ASIC-based hardware correlators. Software correlators are therefore ideal for novel 
experiments with very special requirements. The main trade-off for improved perfor- 
mance with a software correlator is the increase in compute time for a fixed number of 
processing elements, or the addition of extra processing elements; 

• The software can easily incorporate data recorded using mixed disk-based recording 
hardware; 

• Medium to large multi-processor computing facilities are available at almost all uni- 
versity and government research institutions, allowing users easy entry into VLBI cor- 
relation; 

• The correlation algorithm is highly parallel and very well suited to a parallel multi- 
processor computing environment; 

• The cost of commodity computing continues to fall with time, making large parallel 
computing facilities more powerful and less expensive; 

• Once written, the code can be ported to a wide range of platforms and recompiled 
with minimal effort. 



We have discussed the implementation of the DiFX software correlator on a standard 
Beowulf cluster at the Swinburne University of Technology and have provided performance 
figures-of-merit for this implementation, showing that relatively large numbers of telescopes 
and relatively high data rates can be correlated in "real-time" using numbers of machines 
that do not exceed the capabihties of moderate to large Beowulf clusters. Clear trade-offs are 
possible in many areas of performance. For example, if real-time operation is not important 
it is possible to dramatically reduce the number of processing elements. 

We have also showed the results of comprehensive testing of the software correlator, 
comparing it output to that of two established hardware correlators, the S2 correlator of 
the Australian Long Baseline Array, operated by the ATNF, and the VLBA correlator. The 
correlator comparisons of visibility amplitude and phase as functions of time and frequency 
verify that DiFX is operating correctly for astronomical VLBI observations. 

DiFX now supports all Austrahan VLBI observations and some global VLBI experi- 
ments, at data rates up to 1 Gbps per telescope. The DiFX code can be downloaded from 
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http://astronomy.swin.edu.au/~adeller/software/difx/. A number of scientific pro- 
grams have aheady been supported by the software correlator and are briefly discussed here. 
Further, a modified version of the software correlator will be used to support a new VLBI 
array in Australia, dedicated to local and global geodetic observations. 

This work has been supported by the Australian Federal Government's Major National 
Research Facilities program, the Australian Research Council's (ARC) Strategic Research 
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Table 1. Comparison of existing hardware correlator parameters 



Correlator 


Type 


Maximum telescopes 
(in one correlator pass) 


Maximum channels 
(per baseline) 


Minimum integration time 

(ms) 


Maximum input data rate Maxi 
(Mbps) 


imum output data rate 
(MB/s) 


Pulsar binning 


VLBA* 


FX 


20 


2048 


131.072 


256 


1 


yes 


JIVE ^ 


XF 


16 


2048° 


126° 


1024 


6'= 


no 


ATNF 


XF 


6 


8192° 


2000 


128 


0.064 


yes 



http:/ / www.vlba.nrao.edu/astro/ obstatus/current/node28.html 
^ http://www.jive.nl/correlator/status.html 
^for up to 8 telescopes 
^when using half the correlator 
^data in lag space 

^http: //www. atnf.csiro.au/vlbi/correlator/ 
^0.5 MHz bandwidth, 2 products 
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Table 2. Maximum decorrelation incurred due to "Post-F" fringe rotation 



Observation Max. baseline Frequency # channels/ 16MHz band Max. decorrelation 





(km) 


(MHz) 




(%) 


LBA low frequency continuum 


1400 


1600 


128 


0.003 


LBA high frequency continuum 


1700 


8400 


128 


0.13 


VLBA low frequency continuum 


8600 


1600 


128 


0.12 


VLBA high frequency continuum 


8600 


22200 


128 


21.1 


LBA water masers 


1700 


22200 


1024 


47.6 
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Table 3. Linear fit parameters for visibifity amplitude vs time for DiFX and the LBA S2 

correlator, with 95% confidence limits 



Baseline 


OffsetDiFx (Jy) 


OffsetLBA (Jy) 


SlopeDiFX (/iJy s ^) 


SlopeLBA (/^Jy s 


PKS- 


NAR 


1.341 ±0.030 


1.343 ±0.028 


10± 13 


14± 12 


PKS - 


MOP 


3.185 ±0.058 


3.185 ±0.063 


14 ±24 


-11 ±26 


PKS - 


HOB 


2.307 ±0.058 


2.293 ±0.061 


-12 ± 24 


- 6 ±24 


NAR- 


MOP 


1.616 ±0.109 


1.619 ±0.114 


-27 ± 43 


-10 ±45 


NAR- 


HOB 


1.142 ±0.111 


1.139 ±0.116 


- 3±44 


- 5 ±46 


MOP - 


HOB 


2.694 ± 0.256 


2.681 ± 0.257 


18 ± 101 


56 ± 101 
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Table 4. Linear fit parameters for visibility amplitude (in units of correlation coefficient) 
vs time for DiFX and the VLBA correlator, with 95% confidence limits 



Basefine 


OffsetoiFX 


Offset VLBA 


SlopeoiFX {s-' X 10-6) 


SlopevLBA {s ^ X 


BR- 


LA 


0.0104 ±0.0004 


0.0103 ±0.0005 


-0.8 ± 1.7 


-0.9 ±1.7 


BR- 


MK 


0.0072 ±0.0005 


0.0071 ±0.0006 


0.1 ± 1.8 


0.5 ±2.0 


BR- 


OV 


0.0125 ±0.0005 


0.0124 ±0.0005 


-0.7± 1.7 


-0.5 ±1.8 


BR- 


PT 


0.0090 ±0.0004 


0.0089 ± 0.0004 


-1.0 ± 1.3 


-1.2 ± 1.5 


BR- 


sc 


0.0069 ±0.0005 


0.0069 ± 0.0005 


-3.1 ±2.0 


-2.5 ± 1.8 


LA- 


MK 


0.0059 ±0.0005 


0.0059 ± 0.0005 


1.9 ± 1.7 


1.4 ± 1.7 


LA- 


OV 


0.0101 ± 0.0005 


0.0100 ±0.0005 


0.4 ± 1.7 


0.6 ± 1.7 


LA- 


PT 


0.0073 ± 0.0005 


0.0072 ± 0.0005 


-0.3 ± 1.7 


-0.5 ± 1.8 


LA- 


SC 


0.0058 ±0.0004 


0.0058 ± 0.0004 


-1.8 ± 1.5 


-1.9 ±1.5 


MK - 


OV 


0.0078 ±0.0004 


0.0077 ±0.0005 


0.9 ± 1.5 


0.3 ± 1.8 


MK - 


PT 


0.0044 ±0.0004 


0.0044 ± 0.0004 


-0.6 ± 1.7 


-0.3 ± 1.5 


MK - 


• SC 


0.0028 ±0.0005 


0.0028 ±0.0005 


-0.6 ± 1.8 


-0.7 ± 1.7 


OV- 


PT 


0.0083 ±0.0005 


0.0082 ± 0.0005 


-1.8 ± 1.8 


-1.9 ± 1.7 


OV- 


SC 


0.0062 ± 0.0005 


0.0062 ± 0.0005 


-0.3 ± 1.8 


-0.2 ± 1.8 


PT- 


SC 


0.0055 ± 0.0005 


0.0055 ± 0.0005 


-1.7±2.0 


-1.3 ± 1.8 



10-6) 
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VisiUlitiee 
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Fig. 1. — Overview of the software correlator architecture. Data is loaded into memory from 
a disk or network connection by Datastream nodes. These nodes are directed by a Master 
node to send data from given time ranges (typically several ms) to the processing elements 
(Core nodes). The processed data are sent to the master node for long-term accumulation 
and storage on disk. 
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Fig. 2. — Benchmark data showing the computational requirements of DiFX to correlate in 
real-time, as described in the text. The nodes are single core 3.2 GHz Pentium processors 
with 1 GB RAM, and in both benchmarks 64 MHz of total bandwidth per station was 
correlated with a 1 second integration period. Top panel shows the scaling of computational 
requirements with number of antenna, using 256 spectral points per 8 MHz subband. Bottom 
panel shows the scaling of computional requirements with spectral points per subbband for 
a ten station array. 
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NAR - HOB 

swc 
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t (seconds from start of day) 



9900 



10" 



1.01x10" 



Fig. 3.— S2 (red) and DiFX (black) visibility amplitude vs time for the 2252 - 2268 MHz 
band on the source PKS 0208—512, as described in the text (PKS = Parkes; MOP = Mopra; 
HOB = Hobart; NAR = ATCA). Symbols represent the actual visibilities produced by the 
correlators, while the lines represent linear least-squares fits to the visibilities (one line per 
dataset). 



9200 9300 9400 9500 9600 9700 9800 9900 

t (seconds since start of day) 

Fig. 4.— S2 (red) and DiFX (black) visibility phase vs time for the 2252 - 2268 MHz band 
on the source PKS 0208—512, as described in the text. Antenna labels as in Figure 2 above. 
The PKS-NAR baseline has been shifted by —50 deg for clarity. 
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Fig. 5. — S2 (red) and DiFX (black) visibility amplitude and phase vs frequency data for 
the 2252 - 2268 MHz band on the source PKS 0208-512, as described in the text. Antenna 
labels as in Figure 2 above. The S2 data has been corrected for fractional-sample error 
decorrelation at the band edges as described in the text. 
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Fig. 6. — VLB A correlator (red) and DiFX (black) visibility amplitude vs time for the 
2283.49 - 2291.49 RCP band from the VLBA test observation MT628, as described in the 
text. The units of time are seconds from UT 00:00:00, and the amplitude scale is correlation 
coefficient. Symbols represent the actual visibilities produced by the correlators, while the 
lines represent linear least-squares fits to the visibilities. The text annotation on each panel 
lists the average correlation coefficient amplitude for each correlator over the time period, 
as tabulated in Table HI 
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Fig. 7. — VLBA correlator (red) and DiFX (black) visibility phase vs time for the 2283.49 - 
2291.49 RCP band from the VLBA test observation MT628, as described in the text. The 
units of time are seconds from UT 00:00:00, and phase is displayed in degrees. 
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Fig. 8. — VLB A correlator (red) and DiFX (black) visibility amplitude and phase as a 
function of frequency for the 2283.49 - 2291.49 RCP band from the VLBA test observation 
MT628, as described in the text. The vertical scale for correlation coefficient amplitude on 
each panel is - 0.018, while the phase scale spans ±180 deg. The horizontal scale for each 
panel displays channels 0-64. 
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Fig. 9. — A two minute average of the ATCA - Parkes cross-power spectrum taken from the 
software correlated data for the OH maser G345— 0.2, as described in the text. The velocity 
resolution is 0.038 km/s at the central frequency of 1.72 GHz. The light gray line showing 
strong maser emission represents the LCP data and the dark gray line with little emission 
represents the RCP data. The maser is highly circularly polarised. 



-41 - 




Fig. 10. — The cross-power dynamic spectrum showing scintillation variations for the pulsar 
B0834— 04 on the Green Bank Telescope - Arecibo baseline. Brightness represents the visi- 
bility amplitude and colour represents the visibility phase. Increasing frequency runs left to 
right and increasing time runs top to bottom. This section of the dynamic spectrum repre- 
sents just 5% of the time span and 0.5% of the bandwidth of the observation (330 seconds 
and 160 kHz). 



